Employing Latent Dirichlet Allocation Model for Topic Extraction of Chinese Text

نویسنده

  • Qihua Liu
چکیده

The hidden topic model of Chinese text, which possesses complicated semantics, is urgently needed, since China has occupied an increasingly significant role during the booming development of globalization over recent years. This paper details and elaborates the basic process of extracting latent Chinese topics by demonstrating a Chinese topic extraction schema based on Latent Dirichlet Allocation (LDA) model. Furthermore, the application was practiced in CCL, an authoritative Chinese corpus, to extract topics for its nine categories. With rigorous empirical analysis, extracting the LDA results has a considerably higher average precision rate as opposed to other three comparable Chinese topic extraction techniques; however the average recall rate is worse than KNN and almost the same with the PLSI model. Moreover, the recall rate and precision rate of LDA-CH is worse than LDA-EH. Therefore, the LDA model should be improved to adapt to the distinctive feature of Chinese words with the purpose of making it better for Chinese topic extraction.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

Chinese Short-Text Classification Based on Topic Model with High-Frequency Feature Expansion

Short text differs from traditional documents in its shortness and sparseness. Feature extension can ease the problem of high sparseness in the vector space model, but it inevitably introduces noise. To resolve this problem, this paper proposes a high-frequency feature expansion method based on a latent Dirichlet allocation (LDA) topic model. High-frequency features are extracted from each cate...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Language model adaptation using latent dirichlet allocation and an efficient topic inference algorithm

We present an effort to perform topic mixture-based language model adaptation using latent Dirichlet allocation (LDA). We use probabilistic latent semantic analysis (PLSA) to automatically cluster a heterogeneous training corpus, and train an LDAmodel using the resultant topicdocument assignments. Using this LDA model, we then construct topic-specific corpora at the utterance level for interpol...

متن کامل

Topic Model for Person Identification using Gait Sequence Analysis

Gait sequence analysis from the input binary silhouettes, has various applications, such as person identification, human action recognition, event recognition and classification. The gait feature extraction is a key step in gait analysis. The ’Topic Model’, used for text classification, is one of the potential semantic approaches to study gait sequence analysis. The proposed algorithm uses Late...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016